“Workshop:ggplot2: from scratch to compelling graphs”

the simple way of plotting in R plot(price ~ carat, data=diamonds) hist(diamonds$price) boxplot(diamonds$price)

using ggplot2 with R studio’s built in diamonds dataset

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

set a plot’s aesthetics with aes()

map the diamond’s cut value to the color aesthetic aes(color=cut)

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut))

or hard code the color variable color='blue'

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color='blue')

histograms

geom_histagram automatically calcualtes heights. No Y value necessary change resolution with the aesthetics binwidth argument

ggplot(diamonds) + geom_histogram(aes(x=price), binwidth = 100)

color property sets border on 2d objects, fill sets fill color `fill=‘red’, color=‘blue’

ggplot(diamonds) + geom_histogram(aes(x=price), fill='red', color='blue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

compare the aesthetics of the following 2 examples:

when you specify aes() inside ggplot function, it applies to all layers. in this case we have color assigned to the diamond’s cut property in ggplot’s aes() function like so… ggplot(diamonds, aes(x=carat, y=price, color=cut))

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point(aes(color=cut)) + 
  geom_smooth()

while in this case, color is assigned to cut, but only in the geom_point() function’s aesthetic.

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point(aes(color=cut)) +
  geom_smooth()

transparency

transparency is achieved here by setting alpha=1/3 in geom_point()

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point(aes(color=cut), shape=1, size=2, alpha=1/3) +
  geom_smooth()

multiples aka facets

assign base function to a variable!

g <- ggplot(diamonds, aes(x=carat, y=price, color=cut))
g + geom_point() + facet_wrap( ~ cut)

1d row or col of facets

g + geom_point() + facet_wrap( ~ cut, nrow=1, scales='free') #also try nrow=1

2d grid of facets

g + geom_point() + facet_grid(color ~ clarity)

scales

scales=‘free’ argument allows the scale of the axis of individual facets to vary scales=‘free_y’ allows free y axis and fixed x axis

using free scales is totally misleading, at a glance suggests all cuts of diamonds are the same price

research forumla notation (use of ~ )….

compare the 2 following examples

ggplot(diamonds, aes(x=price)) + geom_histogram() + 
  facet_wrap(~cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(x=price)) + geom_histogram() + 
  facet_wrap(~cut, scales = 'free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplots…

ggplot(diamonds, aes(x=1, y=price)) + geom_boxplot()

ggplot(diamonds, aes(x=cut, y=price)) + geom_boxplot()

..vs ###violins (more informative!)

ggplot(diamonds, aes(x=cut, y=price)) + geom_violin()

more on violin plots…

when we add geom_point, the x axis is discrete, so not very helpful

ggplot(diamonds, aes(x=cut, y=price)) + geom_violin() +
  geom_point()

usingjittering will make the points more useful to show their density. also, the order matters! put the violin on top layer (last) to make it more useful

ggplot(diamonds, aes(x=cut, y=price)) +
  geom_jitter(alpha=1/4) +
  geom_violin(alpha=1/2, draw_quantiles = .5) 

g1 <- ggplot(diamonds, aes(x=carat, y=price)) +
  geom_point(aes(color=cut))

style your plots with gg_themes

g1 + theme_economist()

g1 + theme_fivethirtyeight()

g1 + theme_wsj() + scale_color_wsj()

g1 + theme_bw()

labels

g2 <-  ggplot(diamonds, aes(x=carat, y=price)) +
  geom_point()

g2 + labs(x='Carat', y='Price ($)', title='Price by Carat')

alternatively

g2 + xlab('Carat') + ylab('Price ($)') + ggtitle('Price by Carat')

transformations

add dollar sign to labels

g2 + labs(x='Carat', y='Price', title='Price by Carat') + 
  scale_y_continuous(label=scales::dollar)

comma delimit numbers note the :: syntax allows you to use function from package you haven’t loaded

g2 + labs(x='Carat', y='Price ($)', title='Price by Carat') + 
  scale_y_continuous(label=scales::comma)  

brewer is a color scale for color blind (i think) (see other scale_color_ functions)

g1 + scale_color_brewer()

move the legend to the bottom

g1 + theme(legend.position='bottom')

zooming

xlim zooms by removing data. this is problematic because the smoothing curve no longer has certain values, thus the output is distorted

g3  <- g2 + geom_smooth()
g3 + xlim(c(0, 3))
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

so how can we just zoom?

g3 + coord_cartesian(xlim=c(1, 3))

rotate 90 degrees

g3 + coord_flip()

polar coordinates

g3 + coord_polar()

an ugly heatmap

library(scales)
library(tidyr)
library(reshape2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# economic indicators dataset
head(economics)
## Source: local data frame [6 x 6]
## 
##         date   pce    pop psavert uempmed unemploy
##       (date) (dbl)  (int)   (dbl)   (dbl)    (int)
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018
#correlation matrix between these specific variables
econCor <- cor(economics[, c(2, 4:6)]) #this is in "y format" 


# todo learn about piping in R %<%  (cmd+shift+ m)
econMelt <- melt(econCor, varnames=c('x', 'y'), value.name='Correlation')

head(econMelt)
##          x       y Correlation
## 1      pce     pce   1.0000000
## 2  psavert     pce  -0.8370690
## 3  uempmed     pce   0.7273492
## 4 unemploy     pce   0.6139997
## 5      pce psavert  -0.8370690
## 6  psavert psavert   1.0000000
econMelt <- econMelt %>%  arrange(Correlation)

head(econMelt)
##          x        y Correlation
## 1  psavert      pce  -0.8370690
## 2      pce  psavert  -0.8370690
## 3  uempmed  psavert  -0.3874159
## 4  psavert  uempmed  -0.3874159
## 5 unemploy  psavert  -0.3540073
## 6  psavert unemploy  -0.3540073
# set heatmap to h
h <- ggplot(econMelt, aes(x=x, y=y)) + geom_tile(aes(fill=Correlation))

#now style the heatmap

h + scale_fill_gradient2(low=muted('red'), mid='white', high='steelblue',
  guide=guide_colorbar(ticks=FALSE, barheight=10), limits=c(-1, 1)) + 
  theme_minimal() + labs(x=NULL, y=NULL) 

todo learn about dplyer